System and Method of Using Dynamic Variance Networks

ABSTRACT

A method and system for determining locations of multiple targets in a document, comprising: creating, utilizing a localization module, information compiled from a training document, the information comprising a reference and a reference vector tying the reference to the multiple targets, wherein the multiple targets includes multiple words.

This patent application claims priority and benefit as acontinuation-in-part of U.S. patent application Ser. No. 12/610,915,filed Nov. 2, 2009, entitled “System and Method of Using DynamicVariance Networks.” The entirety of the application is incorporatedherein by reference.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates a system for obtaining information about at least onedocument, according to one embodiment.

FIGS. 2-4 illustrate a method for locating at least one target in atleast one document utilizing dynamic variance networks (DVNs), accordingto one embodiment.

FIGS. 3-15 illustrate examples of locating at least one target in atleast one document utilizing DVNs, according to several embodiments.

FIGS. 16-18 illustrate a Method for locating at least one target in atleast one document utilizing dynamic sensory maps (DVNs), according toone embodiment.

FIG. 19 illustrates an example of locating at least one target in atleast one document utilizing DVNs, according to one embodiment.

FIG. 20 illustrates a method for obtaining information about at leastone document, according to one embodiment.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

FIG. 1 illustrates a system for obtaining information about at least onedocument, according to one embodiment. In one embodiment, the system 100can comprise at least one communication network 101 that connectshardware and software elements. In some embodiments, the hardware canexecute the software.

The hardware can comprise at least one communications/output unit 105,at least one display unit 110, at least one centralized processing unit(CPU) 115, at least one hard disk unit 120, at least one memory unit125, and at least one input unit 130. The communications/output unit 105can send results of extraction processing to, for example, a screen,printer, disk, computer and/or application. The display unit 110 candisplay information. The CPU 115 can interpret and execute instructionsfrom the hardware and/or software components. The hard disk unit 120 canreceive information (e.g., documents, data) from CPU 115, memory unit125, and/or input 130. The memory unit 125 can store information. Theinput unit 130 can receive information (e.g., a document image or otherdata) for processing from, for example, a screen, scanner, disk,computer, application, keyboard, mouse, or other human or non-humaninput device, or any combination thereof.

The software can comprise one or more databases 145, at least onelocalization module 150, at least one image processing module 155, atleast one OCR module 160, at least one document input module 165, atleast one document conversion module 170, at least one text processingstatistical analysis module 175, at least one document/output postprocessing module 180, and at least one systems administration module185. The database 145 can store information. The image processing module155 can include software which can process images. The OCR module 160can include software which can generate a textual representation of theimage scanned in by the input unit 130 (e.g., scanner). It should benoted that multiple OCR modules 160 can be utilized, in one embodiment.The document input module 165 can include software which can work withpreprocessed documents (e.g., preprocessed in system 100 or elsewhere)to obtain information (e.g., used for training). Document representation(e.g., images and/or OCR text) can be sent to the localization module150. The document conversion module 170 can include software which cantransform a document from one form to another (e.g., from Word to PDF).A text processing statistical analysis module 175 can include softwarewhich can provide statistical analysis of the generated text topre-process the textual information. For example, information such asthe frequency of words, etc. can be provided. A document/output postprocessing module 180 can include software which can prepare a resultdocument in a particular form (e.g., a format requested by a user). Itcan also send result information to an external or internal applicationfor additional formatting and processing. The system administrationmodule 185 can include software which allows an administrator to managethe software and hardware. In one embodiment, individual modules can beimplemented as software modules that can be connected (via theirspecific interface) and their output can be routed to modules desiredfor further processing. All described modules can run on one or manyCPUs, virtual machines; mainframes, or shells within the describedinformation processing infrastructure, such as CPU 115. Database 145 canbe stored on hard disk drive unit 120.

The localization module 150 can utilize at least one documentclassifier, at least one dynamic variance network (DVN), at least onedynamic sensory map (DSM), or at least one fuzzy format engine, or anycombination thereof. A document classifier can be used to classify adocument using, for example, a class identifier (e.g., invoice,remittance statement, bill of lading, letter, e-mail; or by sender;vendor, or receiver identification). The document classifier can helpnarrow down the documents that need to be reviewed or to be taken intoaccount for creating the learn sets. The document classifier can alsohelp identify which scoring applications (e.g., DVNs, DSMs, and/or fuzzyformat engines) should be used when reviewing new documents. Forexample, if the document classifier identifies a new document as aninvoice from company ABC, this information can be used to pullinformation learned by the DVN, DSM, and fuzzy format engine from otherinvoices from company ABC. This learned information can then be appliedto the new document in an efficient manner, as the learned informationmay be much more relevant than, for example, information learned frominvoices from company BCD. The document classifier is described in moredetail with respect to FIG. 20.

As mentioned above, the localization module 150 can include numerousscoring applications, such as, but not limited to, DVNs, DSMs, or fuzzyformat engines, or any combination thereof. DVNs can be used fordetermining possible target values by using references on a document orpiece of a document to determine possible locations for any targets. Ascore can be given for each possible target value identified by the DVN.DVNs are discussed further below with respect to FIGS. 2-15 and 20. DSMscan also be used to determine possible target values based on differentknown locations for the target. A score can be given for each possibletarget value identified by the DSM. DSMs are discussed further belowwith respect to FIGS. 16-20. In addition, fuzzy format engines can beutilized to identify possible target values by using a fuzzy list offormats for any targets. As with DVNs and DSMs, fuzzy format engines cangive a score for any possible target values. Fuzzy format engines arediscussed in more detail below with respect to FIG. 20.

Information generated by the localization module 150 can be sent to thedatabases(s) 145 or to external inputs (e.g., input unit 130,communication network 101, hard disk unit 120, and administration module185). The output or part of the output of the localization module 150can be stored, presented or used as input parameters in variouscomponents (e.g., communications/output unit 105, display unit 110, harddisk unit 120, memory unit 125, communication network 101, conversionmodule 170, database(s) 145, OCR module 160, statistical analysis module175) either using or not using the post-processing module 180. Such afeedback system can allow for iterative refinement.

Document Classifier

As indicated above, the document classifier can be used to classify adocument using, for example, a class identifier (e.g., invoice,remittance statement, bill of lading, letter, e-mail; or by sender,vendor, or receiver identification). The document classifier can operatebased on text in a document. The document classifier can also be basedon positional information about text in a document. Details relating tohow a document classifier can classify a document using any combinationof textual and/or positional information about text from the document isexplained in more detail in the following patents, which are hereinincorporated by reference: U.S. Pat. No. 8,276,067; U.S. Pat. No.6,976,207; and U.S. Pat. No. 7,509,578 (all entitled “ClassificationMethod and Apparatus”).

Once the text information and text positional information is obtainedfor at least one training document, this information can be used toreturn an appropriate class identifier for a new document. (It shouldalso be noted that a human or other application can provide thisinformation.) For example, if invoices issued by company ABC are to bereviewed, certain text (e.g., “ABC”) or text positional information(e.g., where “ABC” was found to be located on training documents using,for example, DVNs or DSMs) found on the training set of documents can besearched on new documents to help determine if the new document is aninvoice issued by company ABC. Documents identified as invoices issuedby company ABC can be reviewed with company ABC-specific DVNs, DSMsand/or fuzzy searching machines.

It should be noted that the document classification search can beperformed in a fuzzy manner. For example, punctuation or separationcharacters, as well as leading or lagging alphabetical characters andleading or lagging zeroes can be ignored. Thus, for example, “123-45”,“1/2345”, “0012345”, “INR1234/5” can be found if a fuzzy search is donefor the string “12345”. Those of ordinary skill in the art will see thatmany types of known fuzzy searching applications can be used to performthe document classification search. Other examples of fuzzyrepresentations and their respective classification are described infurther detail in the following patents, which are herein incorporatedby reference: U.S. Pat. No. 7,908,430; U.S. Pat. No. 6,983,345; and U.S.Pat. No. 7,433,997 (all entitled “Associative Memory”).

As explained above, the document classifier can help narrow down thedocuments that need to be reviewed. The document classifier can alsohelp identify which scoring applications (e.g., DVNs, DSMs, and/or fuzzyformat engines) should be used when reviewing new documents. Thislearned information from the DVNs, DSMs, and/or fuzzy format engines canthen be applied to the new document in an efficient manner, as thelearned information may be much more relevant than, for example,information learned from invoices from company BCD.

FIG. 20 illustrates an example use of document classifiers with scoringapplications. (It should be noted that document classifiers do not needto be used to narrow down the documents. It should also be noted thatmany other scoring applications can be utilized. Furthermore, it shouldbe noted that other applications can be used to determine informationabout targets.) Referring to FIG. 20, in 2005, a document classifier isutilized to choose the most relevant scoring information. For example,if the document classifier identifies a new document as an invoice fromcompany ABC, this information can be used to pull information learned bythe DVN, DSM, and fuzzy format engine from other invoices from companyABC. In 2010, the relevant DVN, DSM and fuzzy format information (e.g.,related to invoices issued by company ABC) can be applied to theclassified document to obtain any possible target values along with ascore for each. In 2015, validation rules can be used to narrow down theset of possible target values. For example, only possible target valuesfor targets NET, VAT and TOTAL that satisfy the formula NET+VAT=TOTALcan be returned as filtered possible target values. Other examplevalidation rules could include: that the date of a document has to belater than Jan. 1, 2005; or that an order number needs to be within acertain range. In 2020, the filtered possible target values are comparedto each other, and the filtered possible target value with the highestscore can be used as the target value. Note that in other embodiments,all filtered possible target values, or even all unfiltered possibletarget values could be shown to a person or fed to another application.

Dynamic Variance Networks (DVNs)

FIG. 2 illustrates a method 200 for locating at least one target in atleast one document utilizing DVNs, according to one embodiment. In 205,one or more documents (or pieces of documents) can be used for training.In 210, at least one DVN can be created from information compiled fromthe training set of documents. The DVN can be a set of “keyword”references (e.g., any textual/digit/character block, such as a word,number, alpha-numeric sequence, token, logo, text fragment, blank spaceetc.) and reference vectors for this set of references. Each referencevector can connect a reference to a target. In 215, the DVN can beapplied on untrained documents to localize at least one target on theuntrained documents. The localization can determine where on theuntrained documents the position of the target is expected to be. Thiscan help obtain or confirm information about the target (e.g., such asthe target value Jan. 10, 2009 for the target “invoice date”). Forexample, if the target is a document field, such as a date, the valuepresent at the target can be extracted. If there is no reference at thegiven target position, it can be indicated that the target is notpresent on the document. Example targets can include, but are notlimited to: check boxes, signature fields, stamps, address blocks,fields (e.g. total amount for invoices, weight of package on deliverynotes, credit card number on receipts), manual or automatically editedentries on maps, image related content in text/image mixed documents,page numbers, etc.

It should be noted that the above method 200 can provide increasedredundancy and accuracy. Because every reference is a potential basisfor target localization, there can be hundreds of reference anchors perpage for each target. Thus, even for torn pages, where all classicalkeywords are missing, a target localization can be found.

In addition, it should be noted that a reference with a typo ormisrecognized by an OCR engine at a particular position canautomatically be used as an anchor based on where the reference isfound. Thus, in some embodiments, there is no need to specifytraditional keywords or apply any limitation to anchor references. Inthis way, strict and/or fuzzy matching can be utilized to match anysimilar reference to at least one reference in a new document.

Furthermore, the following characteristics of the reference can be takeninto account when matching: font; font size; style; or any combinationthereof. Additionally, the reference can be: merged with at least oneother reference; and/or split into at least two references.

FIG. 3 illustrates details of the method 210 for creating the DVN fromthe training set, according to one embodiment. In 305, a set of“keyword” references can be created from at least one reference found onat least one document used for training. In 310, at least one referencevector can be created for each reference.

FIG. 5 illustrates a view of a document, where gray areas 510 denotedifferent references that could be used as the set of “keyword”references. The reference vectors 515 are lines from each reference to aparticular target 505. Different colors of gray can indicate differentcontent. For example, the darker gray could represent content that isword content. As another example, the lighter gray could representcontent that is a number or a combination of numbers and letters.Additional examples of content include, but are not limited to: numberand punctuation strings, OCR-mis-recognized characters (e.g., “/(!*7%8[]4$2§” for part of a stamp on an image), words in different languages,words found in dictionaries, words not found in dictionaries, differentfont types, different font sizes, different font properties, etc.

In 315, variance filtering can be performed by selecting similarreference vectors. The variance filtering can compare the references andthe reference vectors for all documents in the learn set, compare thetype of references, and keep similar reference vectors. Similarreference vectors can be similar in terms of position, content similar,and/or type similar for the reference. A reference can be positionallysimilar when the reference is usually found in one or more particularplaces on a page. Content similarity relates to references having thesame type of content (e.g., when the references are all the same word orsimilar words). Type similarity relates to the reference usually being aparticular type (e.g., a numerical value, a word, a keyword, a fonttype, etc.). Similarity types can be tied to other similarity types(e.g., when the references are all content similar, the same word orsimilar words, but only when the references are type similar as well(e.g., all of the type “date”)).

It should be noted that the reference consistency tests can be fuzzy. Anexample of fuzzy testing with a positionally similar reference is wheneverything within a defined x and y coordinate space is utilized, andthe space parameters are able to be adjusted. An example of contentconsistency is determined by comparing words. Thus, “Swine-Flu”,“swineflu”, “Schweinegrippe” and “H1N1” can be assumed to be identicalfor a special kind of fuzzy comparison. “Invoice Number”, “Inv0!ce No.”and “invoiceNr” can be assumed to be identical for another kind of fuzzycomparison. An example of type similar fuzzy testing is when more thanone type can be used (e.g., both “number” type and “number/letter” typefor a date).

In 320, the similar reference filters are used to create the DVN. Forexample, FIG. 6 illustrates the DVNs (i.e., reference vectors for the“keyword” references) for six documents. The six documents illustratethe variability in terms of references and positions across differentdocuments and its effect on the reference vectors.

FIG. 7 illustrates the variance filtering 315 (e.g., overlaying) of allsix documents from FIG. 6. 705 illustrates the reference vectors in FIG.6 on one stack. The variability and consistency of the reference vectorsis indicated with the darkness of the lines. The darker the line on FIG.7, the more often the reference vector was found when overlaying thedocuments. 710 illustrates the effect of a consistency filter on thereference vectors. The minimal amount of consistency across thereference vectors and the documents can be configurable and can have avalue between 1 (meaning every reference vector is kept) and N (thenumber of documents in the current set, meaning only the referencevectors present on all documents are considered useful). For example, ifthe selected value for the consistency is 5, and the number of documentsis 7, the similar vector for one specific word at a specific positionmust be found on 5 out of 7 documents to keep this reference vector.

It should be noted that the content, position and type of reference canbe used to filter reference vectors and construct the DVN, especiallywhen only totally similar reference vectors are used. FIG. 9 illustratesan example result when only fully similar (e.g., the reference vectorsare similar (e.g., lining up) or similar in a fuzzy manner (e.g., almostlining up, “almost” being a preset variance) in all documents in thelearn set) reference vectors are kept. References 905 have maximumstability (e.g., content, position and type similar), and, in oneembodiment, could be represented in a first color. References 910 areonly stable with respect to position and type, and, in one embodiment,can be shown in a second color. References stable in neither position,content or type are not shown in FIG. 9.

Note that the image of a reference can be blurry in some situationsbecause identical content with small positional changes can render thewords readable but blurry. When the content is not the same (e.g.,numbers for the invoice date, invoice number, order date and ordernumber), the content may be unreadable in the overlay. As shown in FIG.8, 810 illustrates the variability of the content and its impact on thevariability filtering (e.g. when each document in the learn set isoverlaid on one another), according to one embodiment. In 815, enlargedversions of words having low content variance are shown. In 820,enlarged versions of words having high content variance are shown. Inone embodiment, content with no or low variance can be considered asmore valuable information for building the dynamic variance network, dueto content stability. More variable content (e.g., date), can bereferred to as unstable reference points and can be considered lessimportant.

FIG. 4 illustrates details of applying the DVN for target localizationon untrained documents 215, according to one embodiment. In 405, allreferences on the document to be processed are compared with a DVN“keyword” reference list to determine which references are mostrelevant. The DVN “keyword” list is a list of references consistentlyfound by the training. In one embodiment, only references found in allthe documents used for training are used on the DVN “keyword” referencelist. In other embodiment, references found in most of the documentsused for training can be used.

For example, using the example of 710, 805 and 810 of FIGS. 7 and 8, thesimilar references from the training could include the followingword-type references (designated by a dark gray): “Invoice N”, “InvoiceDate”, “Order No.”, “Order Date”, “Description”, and “Amount”. Variantsof these references (e.g., Order Number instead of Order No.) can alsobe used. The similar references from the training could also includenumber or number/letter character strings (designated by a light gray)of the form: ______ (for Date), ______ (for Invoice No.), ______ (forOrder No.), and ______ (for Order Date).

In 410, all of the reference vectors that relate to the “keyword”references can be used to point towards the target. In 415, theintegrating of the pointer information from all of the reference vectorsand the reference keywords can then used to localize (determine) thetarget.

For example, in FIG. 10, in 1005, all the references for a document areshown. In 1010, the references after the positional consistency filteris applied are shown. In 1015, the reference vector information fromthese references from various documents are applied and compared. In1020, the similar reference vectors are used to determine the localityof the target.

Once possible positions for the locality of any targets are found usingthe DVNs, possible values for the targets can be found (e.g., Jan. 10,2009 as the value for the target “invoice date”). Each possible valuefor the target can be given a score. The score can be determined by theratio of the reference vectors hitting the target against the referencevectors not pointing to the target. Additionally, the fuzzy editdistance between the learned reference(s) (e.g., text) and thereference(s) used for localization can be integrated as a weight. Forexample, if all possible reference words on a document could be foundexactly at the same relative position from the target as the ones storedin the learn set, the highest score can be returned. Additionalreferences not contained in the learn set, or references with no vectorspointing towards the respective target can reduce the score.

It should be noted that DVNs can be used for many additional tasks,including, but not limited to: the addition of reference vectors,reference correction, document classification, page separation,recognition of document modification, document summarization, ordocument compression, or any combination thereof. These tasks areexplained in more detail below.

Addition and/or removal of Reference Vectors. DVNs can be dynamicallyadapted after target localization. When at least one reference vector islearned and used to localize a target, all other possible referencevectors can be created and dynamically added to the DVN learned in 210of FIG. 2. Furthermore, outdated (e.g., reference vectors not used for along time, or filtered) can be removed. This can allow continuousupdating of the reference vectors from all processed documents. Such acontinuous updating procedure can update and change the DVN duringdocument processing.

Reference Correction. Reference vectors can be used for referencecorrection. An example is illustrated in FIG. 11. At 1105, one learndocument containing one target 1107 and three anchor references(“991826”, “!8%!”, “example”) is shown. The respective reference vectors1115 from the references to the target are also shown. After learning,the set of reference vectors 1115 is matched on a different document1130. On this document 1130, the reference “example” is corrupted andspelled “Example”. However, due to its location, “Example” can bematched to “example” and be replaced in 1140. This ability can helpresult in reference correction on processed documents.

Another example of reference vectors being used for reference correctionis when the reference vectors are used to locate a target of, forexample, a specific type. Additional information present can then beused to correct a potentially corrupted target. For example, if thereference vectors point towards the reference “29 Sep. 1009”, and thisreference is known to be a date field target from a currently retrieveddocument, then a correction of that target to “29 Sep. 2009” ispossible. To do this correction, the high similarity between “September”and “September” is used in a fuzzy content comparison and additionalinformation about the entry being a date can be used to correct the yearto a (configurable) time period that seems to be valid. It should alsobe noted that, if a date field target is clearly located, then thereference vectors can be followed back to the potential anchorreferences. It for example, the position & information for such ananchor reference perfectly fits, then the actual reference presentthere, but not fitting to the anchor reference present in the learnedDVN could be replaced by the one from the learned DVN. For example, ifthe invoice number field target was located, the surrounding classicalkeyword which is corrupted and shows “Inv0!ce Number” could be replacedby the one stored for this position from the learned DVN. Thus, afterthat correction, “Invoice Number”, could be read at that position.

Document Classification. As explained earlier with respect to FIG. 1,the learned DVN can also be used for document classification, as shownin FIG. 12. Two documents (1205 a and 1205 b), with references anchoredon the targets of the documents (1210 a and 1210 b) are shown. Thereference vectors for document 1205 a point to anchor reference words.For document 1205 b, some of the reference vectors point to anchorreference whitespace. The quality of fit of the learned DVN can bemeasured and serve as an indicator as to whether the present document isfrom the same “category” or “class” as the one where the learned DVN wastrained. In a many class scenario for such an application, for allcollected DVNs, the overlap of the reference vectors on one target areacan be measured. A high overlap of many reference vectors indicates thatthe anchor words may be in a similar relative position to one or manytargets. This high overlap information can be used as information todetermine from which class or set of documents the DVN was created.

Page Separation. Positional information regarding anchor references canalso be used for page separation. In a stack of different documents(e.g., single documents, multi-page documents), the changes in the DVNspositional information (also referred to as “quality of fit” can provideinformation about the starting page of a new document. This method canbe used to, for example, repackage piles of documents into singledocuments.

Recognition of Document Modification. DVNs can also be used in a reversemanner (e.g., after having located a target, looking up how well theanchor words on the present document fit to the learned anchor words ofthe DVN), to recognize a document modification. For example, in FIG. 13,one document (1300 a) is learned (e.g., the DVN is created for at leastone target) and then this DVN is matched later onto the potentiallyedited document (1300 b) to detect modifications. There are three basictypes of modification: 1) a reference vector points to a reference thathas same position, but changed content (1310); 2) the reference vectorpoints to whitespace (1320), indicating that the reference there mayhave been deleted or moved; and 3) there are references with noreference vectors (e.g., these may be added words 1230). Suchmodifications can include, but are not limited to: an exchange of words,a rephrasing of words, a removal of document parts, changes in documentlayout, font size or font style. Additionally, the comparison of severalDVNs for different targets on one document can allow for precise“fingerprinting”, essentially giving a robust and sensitive method todetect any non-typical changes in the document. For example, frequentchanges in the revision number for contracts can be ignored whilechanges in wording can be highlighted. The option to return where andwhat was changed can be provided.

DVN for Multiple Targets. DVNs can also be used to localize multipletargets on a document in some example embodiments. In this regard, amulti-target area that may, for example, contain multiple words istrained. Examples of multiple word targets include, but are not limitedto, addresses, terms and conditions, phrases, full names, and content offree-form fields in forms. The relative position of a bounding box orbounding area (formed for example by polygon, splines or other shapesforming an circumference around at least one target, typicallycomprising all desired targets) is then anchored to the document withthe method described above for localization of single targets. In someexample embodiments, the bounding box or bounding area may be extendover multiple rows. The two dimensional statistics of the spread, size,probability density of target localizations within an area covered ofthe bounding area are trained based upon the statistics of the learn-setdocuments presented to the system. There, positional information,variance across positions, frequency of targets, their abstractedcontent (e.g., via a character to format transformation) play a role inhow to determine the area which is learned. For target localizationpurposes, the statistically enriched bounding area is positioned on thedocument by the anchoring method described above for the DVN targetlocalization. Then potential targets desired for extraction within thisarea are selected based on the learned statistics for the bounding areaand returned as a multi-target localization. Such a return valuetypically comprises more than one word as a result.

Document Summarization. DVNs can also be used to automatically summarizedocument content. This process is illustrated in FIG. 14. In thisexample, two documents (1400 a and 1400 b) are used as inputs, two DVNsare created, and these two DVNs are analyzed for their variability. Thevariance is shown in 1420 as a slightly shifted (for visual aid) overlapof the two DVNs. Note the positional and possible content variability ofthe references. An example for content variability, that also applies tothis case is shown in FIG. 9, where 905 show stable content and 910shows content with a certain variance. Based on this information, twosummaries can be constructed: a stable summary (1430), which keeps onlysimilar references, and a variable summary (1440), which keeps changingreferences. The (low variance) stable reference vectors to any target ona document can represent the “form” or “template” of the document. The(high variance) variable reference vectors can indicate individualinformation per document, and can thus be valuable for automaticsummarization.

Document Compression. DVNs can also be used for compression of adocument or set of documents. In FIG. 15, document compression isillustrated for four different documents (1500 a, 1500 b, 1500 c, 1500d) and their respective DVNs. In the uncompressed case (1501), all fourdocuments have to be stored. In the compressed case (1520), only thestable DVN (shown in 1510) and the deviations from that DVN (1505 a,1505 b, 1505 c, 1505 d, 1505 e), with the respective positions on thedocuments for each of the non-DVN mapped words have to be stored. Forexample, 1505 a could be the string “Management-Approved” at documentcoordinates +1902×+962 relative to the top left corner of the document.Such variable information can be stored for 1505 b, 1505 c, 1505 d, and1505 e. This can be seen as the application of a delta compressionalgorithm on the basis of the DVN. In this scenario, the DVNs and thedeviations from the DVNs are stored separately, thus the redundancy ofthe DVNs reduce the amount of data to be stored over many documents.Moreover, all aforementioned DVN applications can be used on compresseddocuments as well, without the need to unpack them.

Dynamic Sensory Maps (DSMs)

FIG. 16 illustrates a method for locating at least one target in atleast one document utilizing DSMs, according to one embodiment. In 1610,one or more documents (or pieces of documents) can be used for training.In 1620, at least one DSM can be created from information compiled fromthe training. The DSM can be a set of possible locations for at leastone target. In 230, the DSM can be applied on untrained documents tolocate the target using the target possible locations.

FIG. 17 illustrates details related to creating the DSM in 1620,according to one embodiment. In 1710, the at least one target isidentified. In 1720, the probability for the most likely position of thetarget is determined. If the target location is from the first documentin a set of training documents, such target location can be used as theprobable location of the target. As further training documents areanalyzed, the possible target locations can be increased to includeother locations. The probability for each possible target location canalso be determined by counting the frequency of the target being foundat that location (e.g., 7 times in 10 documents). The probability foreach possible target location can thus be increased or reduced (e.g.,resembling un-learning or inclusion of counter-examples) as additionaldocuments are reviewed.

FIG. 19 illustrates an example of creating the DSM. For three differentdocuments (1910 a, 1910 b, 1910 c) the location of the targets (1940 a,1940 b, 1940 c) is determined Gray boxes indicate other potentialtargets or references on the documents. In 1950, the three documents(1910 a, 1910 b, 1910 c) are overlaid in a manner such that the documentboundaries are aligned. The respective DSM is shown in 1970, where thedifferent gray levels of 1980 can indicate different possible locationsfor the targets. The DSM of 1970 also indicates two different axes (1985and 1990), so that possible locations of a target can be used on otherdocuments in a systematic manner (e.g., using its relative position onan x and y axis). For example, for a “total amount” target on invoices,it can be determined that the position along the 1985 axis can morereliable than along the 1990 axis. This type of information can be takeninto account as a secondary criterion for the sorting of the potentialcandidates for a target during extraction.

FIG. 18 illustrates details related to applying the DSM in 1630,according to one embodiment. In 1810, the DSM is overlaid onto thedocument where a target is to be localized. In 1820, the possibleposition(s) of the target (along with the probability for each possibleposition) is obtained from the DSM. In 1830, these possible positionscan be sorted so that the position with the highest probability can bedeemed to be the position of the target. Once the position of the targetis determined, information about the target (e.g., an amount listed inthe “total amount” field) can be found.

Fuzzy Format Engines

Fuzzy format engines can collect a list of fuzzy formats for at leastone target from training documents. During the extraction phase, thefuzzy format engine can calculate a score for the matching of thelearned formats to the potential target. For example, given the targetvalue “102.65$” for an amount type target, the fuzzy format engine couldlearn from the training documents that, in the representation “ddd.ddR”,d represents digit and R represents a currency signal. If the fuzzyformat engine then finds a string “876.27$”, then this string can bedetermined to be a potential target value with a very high score (e.g.,10). However, if the string “1872,12$” is found, the score could bereduced by one for the additional digit, and reduced by another one forthe comma instead of the period, for a score of 8. As another example, afuzzy format engine could learn that “INVNR-10234” could be representedas “CCCC-ddddd”, where C represents capital characters and d representsdigits. Those of ordinary skill will see that many type of fuzzy formatengines can be used, and there can also be many types of scoringutilized. Examples of other possible scoring systems are, for example:the different handling of missing or additional characters and digits(e.g., having a 0.125 score penalty per missing or additional charactervs. a 0.25 penalty for a missing or additional digit); character stringsimilarity measures that can be obtained as described in the followingpatent patents, which are herein incorporated by reference: U.S. Pat.No. 7,908,430; U.S. Pat. No. 6,983,345; and U.S. Pat. No. 7,433,997 (allentitled “Associative Memory”).

While various embodiments of the present invention have been describedabove, it should be understood that they have been presented by way ofexample, and not limitation. It will be apparent to persons skilled inthe relevant art(s) that various changes in form and detail can be madetherein without departing from the spirit and scope of the presentinvention. Thus, the present invention should not be limited by any ofthe above-described exemplary embodiments.

In addition, it should be understood that the figures described above,which highlight the functionality and advantages of the presentinvention, are presented for example purposes only. The architecture ofthe present invention is sufficiently flexible and configurable, suchthat it may be utilized in ways other than that shown in the figures.

Further, the purpose of the Abstract of the Disclosure is to enable theU.S. Patent and Trademark Office and the public generally, andespecially the scientists, engineers and practitioners in the art whoare not familiar with patent or legal terms or phraseology, to determinequickly from a cursory inspection the nature and essence of thetechnical disclosure of the application. The Abstract of the Disclosureis not intended to be limiting as to the scope of the present inventionin any way.

Finally, it is the applicant's intent that only claims that include theexpress language “means for” or “step for” be interpreted under 35U.S.C. 112, paragraph 6. Claims that do not expressly include the phrase“means for” or “step for” are not to be interpreted under 35 U.S.C. 112,paragraph 6.

1. A method for determining locations of multiple targets in at leastone document, comprising: creating, utilizing a localization module,information compiled from at least one training document, theinformation comprising a reference and a reference vector tying thereference to the multiple targets, wherein the multiple targets includesmultiple words.
 2. A method for determining locations of multipletargets in at least one document, comprising: creating, utilizing alocalization module, information compiled from at least one trainingdocument, the information comprising a reference and a reference vectortying the reference to the multiple targets, wherein the referenceextends over multiple rows in the training document.